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Abstract 

Many fundamental questions concerning the emergence and subse- 
quent evolution of eukaryotic exon-intron organization are still unset- 
tled. Genome-scale comparative studies, which can shed light on cru- 
cial aspects of eukaryotic evolution, require adequate computational 
tools. 

We describe novel computational methods for studying spliceoso- 
mal intron evolution. Our goal is to give a reliable characterization of 
the dynamics of intron evolution. Our algorithmic innovations address 
the identification of orthologous introns, and the likelihood-based anal- 
ysis of intron data. We discuss a compression method for the evalu- 
ation of the likelihood function, which is noteworthy for phylogenetic 
likelihood problems in general. We prove that after 0(n£) preprocess- 
ing time, subsequent evaluations take 0(n£/ log £) time almost surely 
in the Yule-Harding random model of n-taxon phylogenies, where £ is 
the input sequence length. 

We illustrate the practicality of our methods by compiling and 
analyzing a data set involving 18 eukaryotes, more than in any other 
study to date. The study yields the surprising result that ancestral 
eukaryotes were fairly intron-rich. For example, the bilaterian ancestor 
is estimated to have had more than 90% as many introns as vertebrates 
do now. 
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1 Introduction 



Typical eukaryotic protein-coding genes contain introns, which are removed 
prior to translation. Key constituents of the spliceosome, which is the RNA- 



protein complex that performs the intron excision, can be traced back ( |Collins and Penny 2005[ ) 
to the last common ancestor of extant eukaryotes (LECA). Even deep-branching 
lineages (IVanacova et al. 2005| INixon et al. 20021) have introns. It is thus 
almost certain that spliceosomal introns were present in LECA. Moreover, 



when comparing distant eukaryotes, intron positions often agree (Rogozin et al. 2003) 



The similarity is likely due more to conservation of early introns than to 
parallel gains dRoy and Cilbert 2005[ ISverdlov et al. 2005|) . It is thus com- 
pelling to use genome-scale comparisons to study intron evolution in different 
lineages, and even to estimate the exon-intron organization in extinct ances- 



tors. One of the first such studies, by Rogozin et al. (2003) , involved orthol 



ogous gene sets in eight eukaryotes. The same data set was reanalyzed by 
different authors ( |Roy and Cilbert 2005[ ICsuros 20051 ICarmel et al. 20051 
Nguyen et al. 2005] ), using novel methods developed for intron data. Subse- 
quent inquiries (INielsen et al. 2004( |Roy and Penny 2006| |Roy and Penny 2007 



Coulombe- Huntington and Majewski 2007) attest to a renewed interest in 



understanding the specifics of intron evolution within different eukaryotic 
lineages. This paper introduces novel computational techniques for the anal- 
ysis of spliceosomal intron evolution, anticipating more large-scale studies to 
come. 

Section [2] describes an alignment method for intron-annotated protein se- 
quences, as well as a segmentation method for identifying conserved portions 
of a multiple alignment. Section [3] describes a likelihood framework in which 
intron evolution can be analyzed in a theoretically sound manner. Section H] 
scrutinizes a compression technique that accelerates the evaluation of the like- 
lihood function. The compression involves an 0(n£)-time preprocessing step 
for I sites and a phylogeny with n species. We show that the subsequent eval- 
uation takes sublinear, 0(n£/ log t) time almost surely in the Yule-Harding 
model of random phylogenies, even in the case of arbitrary, constant-size 
alphabets. Fast evaluation is particularly important when the likelihood is 
maximized in a numerical procedure that computes the likelihood function 
with many different parameter settings. Section \5\ describes two applications 
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Figure 1: Fragment of a multiple alignment before and after realignment us- 
ing intron annotation. Shaded rectangles show the intron positions projected 
to the protein sequences. 



of our methods. In one application, intron-aware alignment was used to vali- 
date some unexpected features of intron evolution before LECA. In a second 
application, we analyzed intron evolution in 18 eukaryotic species. We found 
evidence of intron-rich early eukaryotes and a prevalence of intron loss over 
intron gain in recent times. 



2 Identification of orthologous introns 
2.1 Intron-aware alignment 

Orthologous introns can be identified by using whole genome alignments in 



the case of closely related genomes ( Coulombe- Huntington and Majewski 2007) 



For distant eukaryotes, however, intron orthology can only be established 



through protein alignments (Rogozin et al. 2005). The usual procedure is to 



project intron positions onto an alignment of multiple orthologous proteins. 
If introns in different species are projected to the same alignment position, 
then the introns are assumed to be orthologous. 

Intron annotation can be included in protein alignments by defining in- 
tron match and mismatch scores. The alignment score is then computed as 
the sum of scores for amino acid matches, gaps, and intron matches. Incorpo- 
rating intron annotation should lead to better alignments at the amino acid 
level, and to a more reliable identification of orthologous introns. Figure [1] 
shows an example of such improvement. 

Intron scoring can be based on log-likelihood ratios in a probabilistic 
model (IDurbin et al. 19 98). The model is defined by the joint distribution p 



for the intron state in two sequences S and T, and the prior distributions tts 
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and ttt- Aligned sites have states (s,t) with probability its( s ) ' ttt(^) if the 
two sites are unrelated, or with probability p(s, t) in case of homology. An 
(s, t) alignment is scored with a value that is proportional to — log ^^^m ■ 

We used the data set of Rogozin et al. (2003) to assess the strength of 
intron-match signals. Since the data include no sites in which all species 
lack introns, but the model does allow for that, we added extra sites with 
no introns. The original data comprise 7236 intron sites in 684 genes, across 
8 species. Using a method described earlier (ICsuros 20051) . we added 35000 
unobserved intron sites. Using estimates for p and n, we computed the 
appropriate scores. The intron score is asymmetric and varies with evolu- 
tionary distance and intron conservation. Matches for absent introns have 
an insignificant score, but shared introns have a high score, such as 93 
(human-Plasmodium), 106 (human- Arabidopsis) , 152 (human- S. pombe) or 
303 (Drosophila- Anopheles) on a 1/60-bit scale. Shared introns thus give a 
signal comparable to amino acid matches: in the 1/60-bit scaled version of 
the VTML240 matrix (IMuller et al. 20021) . a tryptophan match scores 289, 
and an arginine identity scores 113. 

Consider the case of aligning two protein sequences, S and T, which are 
annotated with the intron positions. Every residue has two associated intron 
sites (after the first and second nucleotides of their codons), and there is 
an intron site between consecutive amino acid positions (phase-0 introns). 
Intron sites may or may not be filled in by introns in either sequence. We 
use the notation for S[i: 0] for the phase-0 site preceding the codon for amino 
acid i, and S[i: 1], S[i: 2] for phase-1 and -2 sites within the codon. Intron 
presence is encoded by 1, and intron absence is encoded by throughout the 
paper. The intron annotation is specified by the variables S[i: j] G {1,0}. 
(There can be no introns after the last amino acid.) Scores for aligned introns 
are specified by a 2 x 2 scoring matrix A. 

Phase-1 and phase-2 intron sites are automatically placed by their asso- 
ciated amino acids. If M is the amino acid scoring matrix, then the align- 



ment of S[i] and T[j] entails a score of M 
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Similarly, aligning S[i] with an indel implies a score 



of M 



S[i 
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, in addition to possi- 



ble gap opening and closing penalties. Standard alignment procedures need 
to be modified to deal with phase-0 introns, since the placement of phase-0 
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introns is not fixed with respect to gaps. It is not possible to simply add 
a new character to the alphabet to represent phase-0 introns because they 
affect gaps differently from amino acids. 

We added intron scoring into a multiple alignment framework, using a 
sum-of-pairs scoring policy with affine gap-scoring. It is NP-hard to find 
the alignment of two multiple alignments under these optimization crite- 
ria (IMa et al. 2003|) . Even the alignment of a single sequence to a multiple 
alignment necessitates sophisticated techniques ( Kececioglu and Zhang 19 98). 
Our solution therefore uses a gap-counting heuristic: namely, a gap-open 
penalty is triggered for an indel aligned with an amino acid if the indel is 
preceded by an amino acid or a phase-0 intron. Gap opening thus corre- 
sponds to a pattern ^ or Here, 1,0 are intron states for the phase-0 
site, and ? denotes either state. In addition, x denotes an arbitrary amino 
acid, - is the indel character, and * is either of the latter two. We imple- 
mented affine gap-scoring by separate gap-open and -close penalties, so that 
gaps at the alignment extremities can be penalized less severely. Gap closing 
is counted for the patterns ~, and ■ 

Table [T] gives the recurrences for a dynamic programming algorithm that 
aligns an intron-annotated sequence S to a multiple alignment P of h intron- 
annotated protein sequences. In order to simplify the presentation, we rep- 
resent the sequences in such a way that every odd position of S and P is 
a regular residue or alignment column, annotated with information on the 
presence of phase-1 and phase-2 introns, whereas every even position is a 

to denote the sum of intron-match scores 



5[i] 
P\0\ 



phase-0 intron site. We use A 

for the intron sites associated with the positions S[i), P\j\. We use also the 

shorthand M y to denote scoring for the alignment of an amino acid x with 

an amino acid profile y. The algorithm uses three types of variables, A[i, j], 
gS[i, j] and gP[i, j] , which correspond to partial prefix alignments ending with 
aligned residues, gaps in S, or gaps in P, respectively. In case of gS[z, j], the 
last indel must be aligned with an amino acid column, and, thus j must be 
odd; for gP[z, j], i must be odd. 

Gaps are scored by using affine penalties, with gap-open, -extend, and 
-close scores, denoted by 7^, 7^, 7^ ■ The gap-counting heuristic implies 
that gap scores in the equations of Table [T] are defined by the number of 
certain patterns in up to three consecutive alignment columns. For instance, 
72 CO equals the gap-open penalty multiplied by the number of such rows 
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Table 1: Recurrences for intron-aware alignment. Odd positions correspond 
to regular amino acids and possibly phase-1 and -2 introns. Even positions 
are placeholders for phase-0 intron sites. 
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7^ 1- or xO- 7^ 1- 

7g x 74 x if S[i] has intron, otherwise nothing 

7^ 1 or xO 7^ 1 

(>) n (>) 
7-j; -Ox or -1* 72 x 

73 1* or Ox 74 x if S[i] has intron, otherwise nothing 

7i-' - 4- ] - 

Table 2: Patterns for gap counting. The index j in 7^ < - ) (j), J^~'{j) 
and 7 ( - > - ) (j) is the index for the last column in the pattern. 



in P where column j contains an indel, and column j — 1 has a phase-0 intron. 
The corresponding pattern is described as 1-. Table [2] lists the patterns for 
the gap-counting heuristic. 

Table [1] does not show the initialization of the variables, nor the fi- 
nal gap-counting: they employ a logic analogous to the recurrences. At 
the end of the algorithm, the best of A[|S|, |P|], gS[\S\, \P\], gP[\S\, \P\) is 
selected, and the actual alignment is found by standard traceback tech- 
niques (IDurbin et al. 19 98). 

We implemented the algorithm in Java. The program iteratively realigns 
one sequence at a time to the rest of the sequences in a multiple alignment. 
Instead of sequence- dependent intron match-mismatch scores, the implemen- 
tation uses only two parameters: one for intron conservation and another for 
intron loss/gain. 



2.2 Identification of conserved blocks 

In order to reliably identify orthologs, we need to be able to distinguish re- 
gions of the alignment that are highly conserved from those that are less 
well-conserved. In poorly conserved regions, we cannot confidently infer in- 
tron orthology. 



Zhang et al. (1999) proposed post-processing pairwise sequence alignments 
into alternating blocks that score above a threshold parameter a or be- 
low (—a). We attained a similar goal by adapting algorithmic technniques 



from Csuros (2004) , The procedure separates a multiple alignment into al- 



ternating high- and low-scoring regions. Using a complexity penalty a, a 
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segmentation with k high-scoring regions has a segmentation score of A — ka 
where A is the sum of scores for the aligned columns. Column scores are 
computed without gap-open and -close penalties. The best segmentation of 
an alignment of length i can be found in 0(£) time after the column scores 
are computed. 

The result of this computation is that the total of column scores in each 
selected high-scoring region will be greater than a. There may be small sub- 
regions of negative scores, but the total score of such a sub-region cannot be 
less than (—a). Conversely, unselected regions score below (—a) and cannot 
have sub- regions scoring above a. 



3 A likelihood framework 



3.1 Markov models of evolution 



We use a Markov model for intron evolution, as in previous studies (ICsiiros 2005P 



For the sake of generality, we describe the Markov model (ISteel 1994} IFelsenstein 2004j) 



over an arbitrary alphabet A of fixed size r = \A\. The intron alphabet is 
A = {0, 1}. A phylogeny over a set of species X is defined by a rooted tree T 
and a probabilistic model. The leaves are bijectively mapped to elements 
of X. Each tree node u has an associated random variable which is 

called its state or label, that takes values over A. The tree T with its pa- 
rameters defines the joint distribution for the random variables The 
distribution is determined by the root probabilities (V (a) : a G A^j and edge 

transition probabilities (p e { a —> b) : a,b <E A^J assigned to every edge e. The 
root probabilities give the distributon of the root state. Edge transition prob- 
abilities define the conditional distributions pj£(wi + i) = b £,(ui) = aj = 

Vu iUi+1 {. a ~ > Along every path away from the root, the node states form 
a Markov chain. The leaf states form the character^ = : u G X). An 
input data set (or sample) consists of independent and identically distributed 
(iid) characters: £ = (&: i — 1, . . . , £). 

In case of intron evolution, introns are generated by a two-state continuous- 
time Markov process with gain and loss rates A e , \i e > along each edge e. 
The edge length is denoted by t e . Using standard results (IRoss 1996[) . the 



transition probabilities on edge e with rates A e = A, /i e = /i and length t e = t 
can be written as p e (0 -> 1) = ^(l - e~* (A+M) ), p e (l 0) = j%r(l ~ 
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e -t(A+/^ anc ip e (o -> 0) = l-p e (0 l),p e (l -»• 1) = l-p e (l -> 0). In the 
absence of established edge lengths, we fix the edge length scaling in such a 
way that A e + /i e = 1. Independent model parameters are thus 7r(l), v e and t e 
for all edges e. It is important to allow for branch-dependent rates, since 
loss and gain rates vary considerably between lineages (IJeffares et al. 2006 



Roy and Gilbert 2006). 



In a maximum likelihood approach, model parameters are set by max- 
imizing the likelihood of the input sample. Let x = (xi, . . . ,xt) be the 
input data. Every Xi is a vector of n states, and we write Xi(u) to denote 
the observed state of leaf u. By independence, the likelihood of x is the 
product P{£ = x} = rijlPl^ — x i}- Each character's likelihood can be com- 
puted in 0{n) time, using a dynamic programming procedure introduced by 



Felsenstein (1981), The procedure relies on a "pruning" technique, which 
consists of computing the conditional likelihoods L u (a) for every node u and 
letter a. L u (a) is the probability of observing the leaf labelings in the subtree 
of u, given that u is labeled with a. The likelihood for the character x equals 
L{x) = P{£ = x} = J2aeA Tr(a)L root (a). 

Intron data are somewhat unusual in that an all-0 character is never 
observed: the input does not include sites in which introns are absent in 
all of the organisms. The uncorrected likelihood function is therefore mis- 
leading, as it underestimates the probability of intron loss. To resolve this 



difficulty, we employ a correction technique proposed by Felsenstein (1992) 



for restriction sites. (Csuros (2005) describes an alternative technique based 
on expectation maximization.) We compute the likelihood under the condi- 
tion that the input does not include all-0 characters. We use therefore the 
corrected likelihood L'(x) = 1 _^ n - ) , and maximize V = \\iL\xi). 



3.2 Ancestral events in intron evolution 

Our goal is to give a reliable characterization of the dynamics of intron evo- 
lution. In particular, we aim to give estimates for intron density in ancestral 
species, and for intron loss and gain events on the edges. Notice that the 
estimation method needs to account for ancestral introns even if they got 
eliminated in all descendant lineages. It is possible to do that with the help 
of conditional expectations, which fit naturally into a likelihood framework. 

For an observed character x, we define upper conditional likelihoods U u (a) 
so that U u (a)L u (a) = P{£ = x,£(u) = a}. Upper conditional likelihoods are 
computed with dynamic programming, from the root towards the leaves, 
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in 0(n£) time (ICsuros 20050 . even for non-binary trees. Similar computa- 
tions are routinely used in DNA and protein likelihood maximization pro- 
grams QAdachi and Hasegawa 1995} IGuindon and Gascuel 2003]) . Here we 
allow irreversible probabilistic models, which explains why U u (a) must be 
computed in a top-down fashion in ([I]). 

The posterior probability for the state at node u is 



gM(x)=p{e(w) = a|e = x} 



U u {a)L u {a) 



J2beAU u (b)L u (b) 
The posterior probabilities for state changes on an edge uv are 



X 



Uja)LJa) 



p uv (a -> b)L v (b) 



a')L,,(a' 



Now, the number of ancestral introns is estimated as the conditional ex- 



pectation N u = £oq[ u \o n ) + qT'izi)- The formula takes into consideration 
unobserved intron sites, by estimating their number £q 



» 



L(0 n ) 



as the 



^ 1-L(0 n ) 

mean of a negative binomial random variable. The number of intron state 
changes is estimated as N v (a — ► b) — £oqj^(O n ) + J2iQab( x i)- m particular, 
N v (l — > 0) is the number of introns lost, and N v (0 — > 1) is the number of 
introns gained on the edge leading to v. 

In order to compute U, we initialize U root (a) = 7r(a). On every edge wt>, 



U u (a)p uv (a 

a£A 



&) I] E a')L w (a') 

ujesiblings(i)) a'GA 

2^ t'uWi.Wp 7 Ar / a - 



:d 



4 Rapid computation of the likelihood 

There are many heuristics that accelerate likelihood-based phylogenetic re- 
construction (IFriedman et al. 2002t IGuindon and Gascuel 2003[) . which mostly 
concentrate on the exploration of the tree space. We propose an improve- 
ment to the evaluation of the likelihood function, which normally takes linear 
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time in the input size. Our evaluation method yields an 0(\og£) speedup for 
typical trees. We use it to optimize the parameters of intron evolution on 
a known tree, but the method is generally applicable to phylogenetic likeli- 
hood problems where the likelihood is numerically optimized. The key idea 
is that it is enough to carry out the pruning algorithm once for every dif- 
ferent labeling within a subtree. Different labelings within each subtree can 
be computed in a preprocessing step. Subsequent evaluations of the likeli- 
hood function with different model parameters are faster, and depend on the 
number of different labelings in the data set. 

Here we describe how the preprocessing step can be carried out in 0(n£) 
time. Secondly, we analyze the computational complexity of subsequent eval- 
uations, and show that an 0(\og£) speedup is achieved for almost all random 
trees in the Yule-Harding model. The latter analysis produces some novel 
concentration results on the number of subtrees with a fixed size k. 



To our knowledge, the closest idea to ours was articulated by Stamatakis et al. (2002) 
Specifically, they proposed identifying characters in which leaves in a sub- 
tree have identical labels. They reported that in benchmark experiments 
with nucleotide sequences, likelihood optimization was accelerated by 12- 
15% through this technique. The technique relies entirely on the fact that 
alignments of closely related sequences exhibit high levels of identity, and 
cannot be extended to non-identical labelings. 

4.1 Yule-Harding model 

The Yule-Harding distribution is encountered in random birth and death 



models of species and in coalescent models (IFelsenstein 20041) . and is thus 
one of the most adequate random models for phylogenies. In one of the 
equivalent formulations, a random tree is grown by adding leaves one by one 
in a random order. The leaves are first numbered by using a random uniform 
permutation of the integers 1,2, ... ,n. Leaves are joined to the tree in an 
iterative procedure. In step 1, the tree is just leaf 1 on its own. In step 2, the 
tree is a "cherry" with leaves 1 and 2. In each subsequent step i = ?>,... ,n, 
a random leaf Y,i is picked uniformly from the set {1, 2, . . . , i — 1}. The new 
leaf i is added to the tree as the sibling of Yi, forming a cherry: a new node 
is placed on the edge leading to Yj and i is connected to it. The resulting 



random tree in iteration n has the Yule-Harding distribution (Harding 1971 ) 
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Algorithm Compress 
for every leaf u do 



C3 



C2 



initialize a[l..r] <— // a[j] is the first occurrence of state j 
for i <— 1 do 



C5 



C4 



if a[xi(ii)] = then a[a;j(u)] <— i 
set <— a[xj(u)] 



C6 



for every non-leaf node u in a post-order traversal do 



Cll 



CIO 



C9 



C7 



C8 



initialize the map H : {1, . . . , i?} 2 {1, ...,£} as empty 
let v, w be the left and right children of u 
for i <— 1, . . . , t do 

if H(hi(v),hi(w)) = null then H(hi(v), hi(w)) <- i 

set <- H(hi(v),hi(w)) 



Figure 2: Computing the auxiliary arrays from which the multiplicities v u 
are obtained. 

4.2 Preprocessing 

The likelihood can be computed faster by first identifying the different Xi 
values, along with their multiplicity in the data. For large trees, the input 
typically consists of many different labelings, but for small trees with n < 
log r £), the compression is useful, since the number of different X; L values is 
bounded by r n < £. In order to exploit the benefits of compression, we extend 
it to every subtree. 

Definition 1. Define the multiset of observed labelings for every node u as 
follows. Let n' denote the number of leaves in the subtree rooted at u, and 
let u±, . . . ,u n > denote those leaves. Define v u (y) for every labeling y e A n 
of the leaves in the subtree of u as the number of times y is observed in the 
data: 



i=i k ' 
where x{-} denotes the indicator function that takes the value 1 if its argu- 
ment is true, otherwise it is 0. Define also the set of observed labelings 



The multisets of observed labelings are computed in the preprocessing 
step. The likelihood function is evaluated subsequently by computing the 
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Algorithm T^onTjKRTJHOOr) 


LI 


for pvprv lp^f 7/ Ho 


L2 


set -^^[ajja'] < — x{ a = a '} for a G &nd a' 6 ./I 


L3 


for every non-leaf node u in a post-order traversal do 


L4 


for every labeling y £ § u do 


L5 


let iii, ^2 be the children of u, and let j/i, 7/2 be their subtree labelings 


L6 


for every a E A 


L7 


Lu[y][a] <- n j= i,2Ea'eyiP«« 3 ( a ~> a ') L n 3 K] 


L8 


set logL <— 


L9 


for every labeling y G S ro ot do 


L10 


logL «- logL + i/ root (|/) • J2aeA log(7r(a)-L roo tMH) 


Lll 


return logL 



Figure 3: Computing the log-likelihood using the observed labelings. 



conditional likelihoods at each node u for the labelings of § u only, in 0(|S U |) 
time. 

In order to compute u u , we use a recursive procedure. It is important to 
avoid working with the 0(n)-dimensional vectors y of Definition [1] directly, 
otherwise the preprocessing may take superlinear time in n. For that reason, 
every labeling y e S u is represented by the index % for which x« is the first 
occurrence of y. Accordingly, we compute an auxiliary array hi(u), which 
stores the first occurrence of each labeling Xi in it's subtree. In particular, 
hi(u) = %' if il is the smallest index such that Xi(uk) = %i'{uk) for all k 
where Uk are the leaves in it's subtree. Figure [2] shows that the values hi(u) 
can be computed in a post-order traversal. After hi(u) are computed for all % 
and u, the multiplicities v u and observed labelings § u are straightforward 
to calculate in linear time. The map H in Lines IC7HC11I is sparse with at 
most I entries, and can be implemented as a hash table so that accessing 
and updating it takes 0(1) time. Consequently, Algorithm Compress takes 
0(n£) time. 

4.3 Evaluating the likelihood function 

After the preprocessing step, the conditional likelihoods are computed only 
for the different labelings within each subtree. Figure [3] shows the evaluation 
of the likelihood function. The running time for the algorithm is 0(s) where s 
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is the total number of different labelings within all subtrees: 



u 

If n u denotes the number of leaves in the subtree rooted at u, then § u has at 
most r nu elements. Hence, s is bounded as 

s < ^min{r n V}. (2) 

u 

Observe that by sheer number of arithmetic operations, it is always worth 
evaluating the likelihood function this way. The worst situation is that of 
a caterpillar tree (where every inner node has a leaf child). In that case, 
there are only a few (|_log r £J — 1) non-leaf nodes for which we can compress 
the data, and it is possible to construct an artificial data set in which there 
are i different labelings for n — 0(\og€) subtrees. Caterpillar trees, however, 
are rare in phylogenetic analysis. Typical phylogenies have fairly balanced 
subtrees ( lAldous 20011) . 



In what follows, we examine the bound of (T5]) more closely in the Yule- 
Harding model. The analysis relies on a characterization of the random 
number of subtrees with a given size, as expressed in Theorems [U and [2J 

Theorem 1. Consider random evolutionary trees with n leaves in the Yule- 
Harding model. Let C k denote the number of subtrees with k — 1, . . . , n — 1 
leaves in a random tree. The expected value of C k is 

EC k = 2n( \ - -i-Y (3) 



k k+1 



Trivially, KC n = 1. 



Proof. Heard (1992) derives Equation ([3]) by appealing to a Polya urn model. 



An equivalent result is stated by Devroye (1991) 



□ □ 

Theorem 2. For all e > 0, 

P{C* < EC k - e} < ef£; (4a) 

F{C k >EC k + e} <e~£. (4b) 
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Proof. Consider the random construction of the tree. Let Yj denote the 
random leaf picked in step i to which leaf % gets connected, for % = 3, 4, . . . , n. 
Each random variable Yi is uniformly distributed over the set {1, 2, . . . , i— 1}, 
and Y3, 14, . . . , Y n are independent. Moreover, they completely determine 
the tree T at the end of the procedure. Consequently, C& is a function 
of (Yi: i — 3, . . . , n): Ck = f(Y 3 , Y" 4 , . . . , Y n ). The key observation for the 
concentration result is that if we change the value of only one of the Yi in 
the series, then Ck changes by at most two. In order to see this, consider 
what happens to the tree T, if we change the value of exactly one of the Yi 
from y to y' . Such a change corresponds to a "subtree prune and regraft" 
transformation (IFelsenstein 2004|) . Specifically, the subtree Tj, defined as the 
child tree of the lowest common ancestor u of y and % containing the leaf i, 
is cut from T, and is reattached to one of the edges on the path from the 
root to y' . Now, such a transformation does not affect Ck by much. Notice 
that subtree sizes are strictly monotone decreasing from the root on every 
path. On the path from the root to u, subtree sizes decrease by the size r 
of Tj, and u disappears, contributing a change of +1, or (—1) to Ck- (At 
most one subtree of size r + k that contains Tj now has size k, and at most 
one subtree of original size k is not counted anymore: it may be it's subtree 
itself, or a subtree above it.) An analogous argument shows that regrafting 
contributes a change of +1, 0, or (—1). Hence, the function /(•) is such 
that by changing one of its arguments, its value changes by at most 2. As a 
consequence, McDiarmid's inequality ( (TT989) ) can be applied to bound the 
probabilities of large deviations for Ck- In particular, for all e > 0, 

p{/(y 3 , • • • , Y n ) - Ef(Y 3 , ...,Y n )<-e}< e-^l* 2 
where c 2 = J2i=3 c l an d 



Cj = max 

y3,—,yn,v,y' 



f{V3, ■ ■ ■ ,yi-i,y,y i+ i, ...,y n ) 

- fiya, ■ ■ -,yi-i,y',yi+i, --^v, 
2 

Since a < 2 for all z, P{C fc < EC fc - e} < e ~^^, implying Eq. (fia) . An 
identical bound holds for the right-hand tail of the distribution. □ □ 



Remark. The particular case of k = 2 was considered by McKenzie and Steel (2000) 



They showed that the distribution of C2 is asymptotically normal with mean n/3 
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n I r n£ n|§ ra ot| bound s 

~~8 7236 2 101304 1386 368 183 

18 8044 2 273496 19142 16764 1196 

47 5216 3 479872 309120 65743 10305 



Table 3: Effect of the compression on three data sets. The fourth column 
quantifies the direct evaluation method, the fifth column quantifies the effect 
of the compression restricted to the root, the sixth column corresponds to 
the bound of Eq. (j2J) and the seventh column gives the exact value of s. The 
first data set is from Rogozin et al. (2003), the second data set is the one 
analyzed here in §5.21 and the third one is an unpublished data set we have 
worked on, where an ambiguity character is included in the alphabet. 



and variance 2n/45. The result suggests that the best constant factor in the 
exponent of Eqs. pj is 45/8 for i = 2, instead of |. Rosenberg (2006) gave 



exact formulas for the variance of C^. He showed that the variance of Ck is 
(2 + o(l))pj-. The variance formulas were given earlier in a different context 



by Devroye (1991) (See also the discussion by Blum and Frangois (2005) ) 
The result suggests that by analogy with the cherries, the best constant fac- 
tor in the exponent is (l/4 + o(l))/c 2 . It is thus plausible that the probability 
is properly bounded by 1 — o(n\og~ 2 i) in Theorem [3] below. 



Theorem 3. With probability 1 — °[j^pr}j> the likelihood function can be 

evaluated for random trees in the Yule-Harding model in O (^^j time after 

an initial preprocessing step that takes 0(n£) time. Evaluating the likeli- 
hood function takes 0(n£) time in the worst case, and 0(n£ log -1 1) time on 
average. 

We need the following lemma for the proof of Theorem [3j 
Lemma 4. For all t > 4, YX=i k(k+i) < 7+1 ■ ^ or a ^t>l and r = 3, 4, . . . , 



k 



^k=l fcCfc+11 — 



fe(fc+l) — t+1 ' 



Proof. The proof is straightforward by induction in t. Notice that the right 
order of magnitude is YX=i k(k+i) = ® (^~ 2r *) tmt; we need a bound for all t. 
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Proof of Theorem^ The preprocessing (Fig. [2]) takes 0(n£) time as dis- 
cussed in §4.21 The evaluation of the likelihood function (Fig. [3]) takes 0(s) 
time. By Eq. 



s < ^C fc min{r fc ^ 



Llog^J n 

E c *r k + E 



(5) 



Let t = \}og r £\ . By Theorem [T] and Lemma H],if£>16orr>3, 

r k f 2n 

fc= - k(k + 1) + It+T 



Es < 2n E 



1 



< 2n- 



r 



t + 1 



+ 2n 



< 



4n£ 



t + 1 - t + 1 



which proves our claim about the average running time. Now, let e 
Plugging e into Theorem (2], we get that 



(6) 



2i(m) ' 



p 



Ct. — ECj 



> e ^ < 2 exp j 



n /l 



t t+1 



(7) 



Let £& denote the event that 



Ck — ECfc < e for k = 1, . . . , t, and let £ i+ i 



denote the event that 



E 



k=l+t 



C fc -EE 



k=l+t Ck 



n' fc=1 £ fc implies £ t+1 . By 

t+1 i 

Pf| £ fc > i-E^ > l-2texp! 



fc=i 



k=l 



<te. Since Eit Cjfc — 2n — 1, 



n /l 



where £& denotes the complementary event to £&. Now, n| =1 £fc also implies 
that s < j^t. Since the likelihood computation takes O(s) time, the theorem 



holds. 



□ 



□ 



Theorem [3] underestimates the actual speedups that the compression 
method brings about. Table [3] shows that compression results in a 50-500 
fold speedup of the likelihood evaluation in practice. Notice also that the 
constants hidden behind the asymptotic notation are quite different between 
the preprocessing and evaluation steps: costly floating point operations are 
avoided in the preprocessing step. It is important to stress that the theoret- 
ical analysis does not rely on similarities in the input data: Theorem [3] and 
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the bound of fl2]) hold for any sample of size £. Real-life data are expected 
to behave even better, as Table [3] illustrates. 

Even though the theorem holds for arbitrary alphabets, the compression 
is less effective for large alphabets (amino acids for example) in practice. 
For DNA sequences, however, it should still be valuable: we conjecture that 
compression would accelerate likelihood optimization by at least an order of 
magnitude. 

We implemented Algorithms Compress and LogLikelihood, along 
with likelihood maximization and the posterior calculations of §3.21 in a 
Java package. Likelihood is maximized by setting loss and gain parame- 
ters for the edges, by using mostly the Broyden-Fletcher-Goldfarb-Shanno 
method (IPress et al. 19971) whenever possible, and occasionally Brent's line 
minimization (IPress et al. 19 97) for each parameter separately. 



5 Applications 

5.1 Ancient paralogs 

In our first example, intron-aware alignment was used to reject a hypothesis 
about whether lack of intron sharing between homologous genes is due to 
poor protein alignments. 

We used intron-aware alignment in a study about ancient eukaryotic 
paralogs (ISverdlov et al. 20070 . In the study, 157 homologous gene fami- 
lies were examined across six species (A. thaliana, H. sapiens, C. elegans, 
D. melanogaster, S. cerevistE and S. pombe) . These families were notable be- 
cause they contained paralogous members in multiple eukaryotic species, but 
not in prokaryotes, and, thus, presumably underwent duplication in the lin- 
eage leading to LECA. Ancient paralogs within and across species share very 
few (in the order of a few percentages) introns. The finding is quite surpris- 
ing, as recent paralogs, resulting from lineage-specific duplications, and also 
orthologs between human and Arabidopsis agree much more in their intron 



sites ( |Rogozin et al. 20030 . Consequently, ancient paralogs either lacked in- 
trons at the time of their duplication, or their duplication involved removal 
of introns. 

In one of the data validation steps for the study, we used intron-aware 
alignment with very high intron match rewards. With larger rewards for 
intron matches, more introns line up in the alignment, but the protein align- 
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ment gets worse. Even by corrupting the protein alignment, we were not 
able to achieve intron sharing levels similar to that of human-Arabidopsis 
orthologs. The lack of intron agreement is therefore not an artifact of the 
protein alignments. 

More details of our study will be described in a forthcoming publica- 
tion fISverdlov et al. 2007p . 

5.2 Intron-rich ancestors 

We compiled a data set with 18 eukaryotic species to give a comprehensive 
picture of spliceosomal evolution among Eukaryotes. 

Species Abbreviation Assembly Source 



Homo sapiens 


Hsap 


36.2 


R 


Rattus norvegicus 


Rnor 


RGSC 3.4 


E 


Takifugu rubripes 


Trub 


FUGU 4.0 


E 


Danio rerio 


Drer 


Zv6 


E 


Drosophila melanogaster 


Dmel 


BDGP 4.3 


R 


Anopheles gambit 


Agam 


AgamP3 


R 


Apis mellifera 


Amel 


AMEL4.0 


R 


Ccenorhabditis elegans 


Cele 


WS160 


R 


CrEnorhabditis briggsas 


Cbri 


CB25 


W 


Saccharomyces cerevisice 


Seer 


2.1 


R 


Neurospora crassa OR74A 


Ncer 




R 


Schizo saccharomyces pombe 972h- 


Spom 


1.1 


R 


Ustilago maydis 521 


Umay 




R 


Cryptococcus neoformans v. n. JEC21 


Cneo 


1.1 


R 


Oryza sativa ssp. japonica 


Osat 


RAP 3 


R 


Arabidopsis thaliana 


Atha 


6.0 


R 


Plasmodium falciparum 3D7 


Pfal 


1.1 


R 


Plasmodium berghei str. ANKA 


Pber 




R 



Table 4: Data sources and species abbreviations. R: RefSeq release 20, E: 
Ensemble release 42, W: WormBase release 160. 

Data preparation 

Genbank flatfiles and protein sequences were downloaded from RefSeq (IPruitt et al. 2007j) 
and Ensembl ( IHubbard et al. 20071) . Exon-intron annotation was extracted 
from the Genbank flatfiles. The C. briggs(E protein sequences and genome 
annotation were downloaded from WormBase (IBieri et al. 20071) . and intron 
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Figure 4: Phylogeny for the 18 species and estimated intron counts at ances- 
tors. Ancestral nodes are identified by the numbers 1-16. Error bars denote 
95% confidence intervals computed from 1000 simulated data sets. We rooted 
the tree at node 16 (Bikonts) for computational purposes. 

annotation was extracted from the GFF annotation file. TableHJlists the data 
sources and the species abbreviations. We used the 684 ortholog sets, each 
corresponding to a cluster of orthologous groups, or KOG (ITatusov et al. 2 003). 



from the study of Rogozin et al. (2003) as "seeds" for compiling a set of puta- 



tive orthologs for our species. For each seed (consisting of homologous protein 
sequences for eight species), we performed a position-specific iterated BLAST 
search (lAltschul et al. 19971) . In case of Plasmodium species, we used three 
iterations against a database of all protozoan peptide sequences in RefSeq. 
We used an E-value cutoff of 10~ 9 for retaining candidates. Each candidate 
hit was then used as a query in a reversed position-specific BLAST (rpsblast) 
search (IMarchler-Bauer et al. 20070 against the KOG database. Candidates 
were retained after this point if they had the highest scoring hit (by rps- 
blast) against the same KOG as the KOG of the seed data, and they scored 
within 80% of the best such hit for the species. 

From the resulting set of paralogs, we selected a putative ortholog set 
in the following manner. For each KOG, we selected all possible triples of 
human- Arabidopsis-Saccharomyces paralogs and kept the triple that had the 
highest score in alignments built by MUSCLE (Edgar 2004). Alignment score 



was computed using the VTML240 matrix (IMiiller et al. 2002]) . Additional 
putative orthologs were added for one species at a time, by aligning each 
paralog individually to the current profile, and keeping the one that gave the 
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largest alignment score. At this iterative addition, scoring was done with 
the VTML240 matrix, by summing the five highest pairwise scores between 
a candidate and already included sequences. 

The resulting sets were then realigned using MUSCLE, and then realigned 
again using our intron-aware alignment with a gap penalty of 300, gap-extend 
penalty of 11, VTML240 amino acid scoring, intron-match scores of 300 
and intron-mismatch penalties of 20. (These latter were established using 
different intron-match and -mismatch scores, and selecting the ones that gave 
the fewest number of intron sites while decreasing the score of the implied 
protein alignment by less than 0.1%.) 

Conserved portions were extracted using our segmentation program with 
a complexity penalty of a = 400 (larger values gave identical segmentation 
results, and lower values resulted in too many scattered blocks) . We penalized 
indels with an infinitely large value to exclude gap columns. Phase-0 introns 
falling on the boundaries of conserved blocks were excluded. Intron presence 
and absence in the aligned data was then extracted to produce the data for 
the likelihood programs. 



Results 



Figure H] shows the estimated intron densities for ancestral species. It is 
notable that ancestral nodes such as the bilaterian ancestor (node 4), the 
ecdysozoan ancestor (node 5), the opisthokont ancestor (node 15), and the 
bikont ancestor (node 16) all have very high intron densities, surpassing 
most previous estimates ( Rogozin et al. 2003 ICsuros 20 05). The ecdyso- 
zoan ancestor has an even higher estimated intron density (80% of hu- 
man density) than the otherwise quite generous estimates (about 70%) of 



Roy and Gilbert (2005) , which is mostly due to the inclusion of the relatively 



intron- rich honeybee genes (IIHBSC 2006|) . Intron density in the bilaterian 
ancestor is estimated to be almost as high as in humans (94%), agreeing with 
estimates of Roy and Gilbert (2005) , Sequences of a handful of intron-rich 



genes from the marine annelid Platynereis dumerilii have already indicated 
that the bilaterian ancestor's genome was at least two-thirds as intron-rich as 
the vertebrate genomes even by conservative estimates (IRaible et al. 2005} 



Roy 2006| ). 

Introns are for the most part lost on the branches. Figure [5] shows the 
estimated changes in a few lineages. Intron evolution has been much slower in 
the vertebrate lineage than in most other lineages: in more than 380 million 
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Figure 5: Intron gains and losses in a few evolutionary paths. Estimated and 
actual intron counts are in parentheses. 

years since the divergence with fishes, only about 3% of our introns got lost. 
Fungi, for example, massively trimmed their introns in many lineages. A 
notable exception is C. neoformans, which seems to have gained introns, 
but that assessment may change if another basidiomycete genome becomes 
available besides the relatively intron-poor U. maydis. 



6 Conclusion 

We presented a novel alignment technique for establishing intron orthology, 
and a likelihood framework in which intron evolutionary events can be quanti- 
fied. We described a compression method for the evaluation of the likelihood 
function, which has been extremely valuable in practice. We also showed that 
the compression leads to sublinear running times for likelihood evaluation. 

We illustrated our methods for analyzing intron evolution with a large 
and diverse set of eukaryotic organisms. The data set is more comprehensive 
than any used in other studies published to this day. The data indicate that 
ancestral eukaryotic genomes were more intron-rich than previous studies 
suggested. 

Many circumstances influence intron loss (IJeffares et al. 2006} Roy and Gilbert 2006), 



and realistic likelihood models need to introduce rate variation (ICarmel et al. 2005) 



Usual rate variation models (IFelsenstein 20041) entail multiple evaluations of 



the likelihood function, and, thus, underline the importance of computational 
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efficiency. We believe that the proposed methods will help to produce and 
analyze large data sets even within complicated likelihood models. 
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